fix: auto-reject PROGRAM messages with non-dict metadata#1137
Conversation
Some PROGRAM messages slipped past validation while ExecutableContent.metadata accepted lists. The current validator requires a dict, so reading those rows fails parsed_content and surfaces as 500s on GET /messages/<hash>. Move them to REJECTED at startup so the API renders them like nodes that rejected them in the first place. The transition logic also lives behind a deployment/scripts helper for ad-hoc cleanups when waiting for a restart is not an option. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
foxpatch-aleph
left a comment
There was a problem hiding this comment.
Clean, well-structured fix for a production bug where PROGRAM messages with list-typed metadata cause 500s via parsed_content. Implements a reusable rejection utility for processed messages, wires a repair function into startup, and ships a companion CLI script. Thorough test coverage and good code quality throughout.
src/aleph/repair.py (line 69): Consider using session.execute(delete_vm_updates(...)) instead of _ = list(...) to avoid loading results into memory and make the intent clearer. The list() is needed to force execution, but a comment explaining why would help maintainers.
mypy on CI flagged the str args. Pick a hex-valid prefix for the good sample hash too so ItemHash() does not reject it at runtime. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
foxpatch-aleph
left a comment
There was a problem hiding this comment.
A well-structured fix for PROGRAM messages with invalid list-typed metadata that cause 500 errors. The approach is correct: repair_node rejects them at startup, a standalone script handles ad-hoc cases, and both properly clean up VM rows and cascade to account_costs via FK. The race condition between the initial query and per-hash processing is properly handled with a re-check. Tests cover the main scenarios (list/dict/null metadata, non-PROGRAM types, empty DB). No bugs or security issues found.
deployment/scripts/reject_processed_messages.py (line 256): Minor: changed is incremented in both the --commit path and the dry-run path, so the summary count is not strictly 'changed' in the commit sense. Consider using two separate counters or mentioning 'processed' in the count label.
Summary
Some PROGRAM messages slipped past validation while
ExecutableContent.metadataaccepted lists. The current validator requires a dict, so reading those rows failsparsed_contentand surfaces as 500s onGET /api/v0/messages/<hash>(ex: 42a4a8...3d96f3 returns 500, while the same hash on epyc properly reports the message as rejected).This change:
mark_processed_message_as_rejectedinaleph.repair. It mirrorsmark_pending_message_as_rejectedbut starts from aMessageDbrow instead of aPendingMessageDb: cleans up VM rows for program/instance, upsertsrejected_messages, flipsmessage_statusto REJECTED, and deletes themessagesrow. The trigger keepsmessage_countsconsistent; FK cascades cleanmessage_confirmationsandaccount_costs._reject_invalid_program_metadataand wires it intorepair_nodeso the API rejects affected PROGRAM messages on every startup. The query usesjsonb_typeof(content->'metadata') = 'array'; an empty result is a no-op.deployment/scripts/reject_processed_messages.pyfor ad-hoc cleanups when a restart is not an option. Dry-run by default,--committo persist; targets specific hashes via--hash/--hashes-file. Runs from inside the API container against the deployed config at/var/pyaleph/config.yml.Test plan
venv/bin/python -m pytest tests/test_repair.py -v— 5 tests, all pass (rejects list metadata, preserves dict/None metadata, ignores non-program types, no-op on empty DB).venv/bin/python -m pytest tests/db/test_messages.py tests/db/test_credit_balances.py— adjacent suites still pass (63 total).venv/bin/ruff check+black+isortclean on changed files.GET /messages/<hash>no longer 500s.🤖 Generated with Claude Code